-
-
Notifications
You must be signed in to change notification settings - Fork 150
feat(vectors): making vector similarity configurable #436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds configurable vector similarity methods to the helix-db vector engine. Previously, the system was limited to cosine similarity with feature flags, but now supports configurable similarity methods including cosine distance, cosine similarity, and Euclidean distance.
- Introduces a
SimilarityMethodenum with three variants: CosineDistance, CosineSimilarity (default), and EuclideanDistance - Updates the vector distance calculation API to accept similarity method parameters
- Removes the
cosinefeature flag and related conditional compilation directives
Reviewed Changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| helix-db/src/helix_engine/traversal_core/config.rs | Adds SimilarityMethod enum and integrates it into vector configuration |
| helix-db/src/helix_engine/vector_core/vector_distance.rs | Updates distance calculation to support multiple similarity methods |
| helix-db/src/helix_engine/vector_core/vector_core.rs | Integrates similarity method into VectorCore constructor and usage |
| helix-db/src/helix_engine/vector_core/vector.rs | Updates HVector distance_to method to accept similarity method parameter |
| helix-db/Cargo.toml | Removes cosine feature flag from feature definitions |
| Multiple test files | Updates test calls to use new distance calculation API with similarity method |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| let c = x - y; | ||
| c * c | ||
| }) | ||
| .sum()) |
Copilot
AI
Sep 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The euclidean_distance function returns the squared Euclidean distance, not the actual Euclidean distance. It should return the square root of the sum: Ok(from.iter().zip(to.iter()).map(|(x, y)| { let c = x - y; c * c }).sum::<f64>().sqrt())
| .sum()) | |
| .sum::<f64>() | |
| .sqrt()) |
| pub mcp: Option<bool>, | ||
| pub bm25: Option<bool>, | ||
| pub schema: Option<String>, | ||
| pub embedding_model: Option<String>, |
Copilot
AI
Sep 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of graphvis_node_label field appears throughout multiple files but the field is still referenced in line 64. This creates an inconsistent state where the field exists in some structs but not others.
…fferent return configurations, wrapped in a single enum that implements different ordering based on similarity calculation method
…ods, v from id/type (#438) ## Description <!-- Provide a brief description of the changes in this PR --> ## Related Issues <!-- Link to any related issues using #issue_number --> Closes #433 #436 #435 ## Checklist when merging to main <!-- Mark items with "x" when completed --> - [ ] No compiler warnings (if applicable) - [ ] Code is formatted with `rustfmt` - [ ] No useless or dead code (if applicable) - [ ] Code is easy to understand - [ ] Doc comments are used for all functions, enums, structs, and fields (where appropriate) - [ ] All tests pass - [ ] Performance has not regressed (assuming change was not to fix a bug) - [ ] Version number has been updated in `helix-cli/Cargo.toml` and `helixdb/Cargo.toml` ## Additional Notes <!-- Add any additional information that would be helpful for reviewers -->
## Description <!-- Provide a brief description of the changes in this PR --> ## Related Issues <!-- Link to any related issues using #issue_number --> Closes #670 #666 #667 #672 #668 #661 #655 #654 #652 #436 ## Checklist when merging to main <!-- Mark items with "x" when completed --> - [ ] No compiler warnings (if applicable) - [ ] Code is formatted with `rustfmt` - [ ] No useless or dead code (if applicable) - [ ] Code is easy to understand - [ ] Doc comments are used for all functions, enums, structs, and fields (where appropriate) - [ ] All tests pass - [ ] Performance has not regressed (assuming change was not to fix a bug) - [ ] Version number has been updated in `helix-cli/Cargo.toml` and `helixdb/Cargo.toml` ## Additional Notes <!-- Add any additional information that would be helpful for reviewers --> <!-- greptile_comment --> <h2>Greptile Overview</h2> Updated On: 2025-11-07 00:19:04 UTC <h3>Greptile Summary</h3> This PR implements arena-based memory allocation for graph traversals and refactors the worker pool's channel selection mechanism. **Key Changes:** - **Arena Implementation**: Introduced `'arena` lifetime parameter throughout traversal operations (`in_e.rs`), replacing owned data with arena-allocated references for improved memory efficiency - **Worker Pool Refactor**: Replaced `flume::Selector` with a parity-based `try_recv()`/`recv()` pattern to handle two channels (`cont_rx` and `rx`) across multiple worker threads - **Badge Addition**: Added Manta Graph badge to README **Issues Found:** - **Worker Pool Channel Handling**: The new parity-based approach requires an even number of workers (≥2) and uses non-blocking `try_recv()` followed by blocking `recv()` on alternating channels. While this avoids a true busy-wait (since one `recv()` always blocks), the asymmetry means channels are polled at different frequencies, potentially causing channel starvation or unfair scheduling compared to the previous `Selector::wait()` approach. The arena implementation appears solid and follows Rust lifetime best practices. The worker pool change seems to be addressing a specific issue with core affinity (per commit `7437cf0f`), but the trade-off in channel fairness should be monitored. <details><summary><h3>Important Files Changed</h3></summary> File Analysis | Filename | Score | Overview | |----------|-------|----------| | README.md | 5/5 | Added Manta Graph badge to README - cosmetic documentation change with no functional impact | | helix-db/src/helix_engine/traversal_core/ops/in_/in_e.rs | 5/5 | Refactored to use arena-based lifetimes ('arena) instead of owned data, replacing separate InEdgesIterator struct with inline closures for better memory management | | helix-db/src/helix_gateway/worker_pool/mod.rs | 3/5 | Replaced flume Selector with parity-based try_recv/recv pattern requiring even worker count, but implementation has potential busy-wait issues that could cause high CPU usage | </details> </details> <details><summary><h3>Sequence Diagram</h3></summary> ```mermaid sequenceDiagram participant Client participant WorkerPool participant Worker1 as Worker (parity=true) participant Worker2 as Worker (parity=false) participant Router participant Storage Client->>WorkerPool: process(request) WorkerPool->>WorkerPool: Send request to req_rx channel par Worker1 Loop (parity=true) loop Every iteration Worker1->>Worker1: try_recv(cont_rx) - non-blocking alt Continuation available Worker1->>Worker1: Execute continuation function else Empty Worker1->>Worker1: Skip (no busy wait here) end Worker1->>Worker1: recv(rx) - BLOCKS until request alt Request received Worker1->>Router: Route request to handler Router->>Storage: Execute graph operation Storage-->>Router: Return result Router-->>Worker1: Response Worker1->>WorkerPool: Send response via ret_chan end end end par Worker2 Loop (parity=false) loop Every iteration Worker2->>Worker2: try_recv(rx) - non-blocking alt Request available Worker2->>Router: Route request to handler Router->>Storage: Execute graph operation Storage-->>Router: Return result Router-->>Worker2: Response Worker2->>WorkerPool: Send response via ret_chan else Empty Worker2->>Worker2: Skip (no busy wait here) end Worker2->>Worker2: recv(cont_rx) - BLOCKS until continuation alt Continuation received Worker2->>Worker2: Execute continuation function end end end WorkerPool-->>Client: Response ``` </details> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->
Description
Related Issues
Closes #
Checklist when merging to main
rustfmthelix-cli/Cargo.tomlandhelixdb/Cargo.tomlAdditional Notes